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Abstract 

The essential features of the in vitro refolding of myoglobin are 
expressed in a solvable physical model. Alpha helices are taken as the 
fundamental collective coordinates of the system, while the refolding 
is assumed to be mainly driven by solvent-induced hydrophobic forces. 
A quantitative model of these forces is developed and compared with 
experimental and theoretical results. The model is then tested by be- 
ing employed in a simulation scheme designed to mimic solvent effects. 
Realistic dynamic trajectories of myoglobin are shown as it folds from 
an extended conformation to a close approximation of the native state. 
Various suggestive features of the process are discussed. The tenets of 
the model are further tested by folding the single-chain plant protein 
leghemoglobin. 



*Work supported in part by the U.S. Department of Energy under 
Grant No. DOE-91ER40651 Task B. 



Solvent-induced organization: 
A physical model of folding myoglobin 

1. Introduction 
a. Synopsis 

It is generally assumed that the native conformation of a protein is given by the 
structure which yields the global minimum of the free energy t^l However, it is less 
clear how a protein reaches this minimum. Simple counting arguments t^' suggest that 
observed folding times, typically 1-100 seconds, are far too short to allow a global 
examination of all available states. (This puzzle is often referred to as the "Levinthal 
paradox"). Thus, there are likely to be physical principles which determine the way 
in which a protein folds. Knowledge of these principles should be of some utility in 
developing tools for predicting protein folding. 

A major stumbling block on the road to developing such tools is the general 
absence of computationally tractable models wherein a protein folds in a realistic 
fashion. One candidate, a model of the large-scale in vitro folding of myoglobin, is 
presented here. Two major assumptions about the folding of myoglobin are made 
in this work and are discussed in detail below. First, it is assumed that the alpha 
helices which comprise myoglobin are nascent at an early stage of folding, and that 
the interactions between these larger subunits determine the tertiary structure of this 
protein. The second major assumption is that solvent effects (the "hydrophobic" 
interaction) are largely responsible for this tertiary structure. 

A quantitative model of hydrophobic forces is developed below, using input from 
experimental data. It will be seen that the model also agrees with microscopic models 
of effective hydrophobic interactions. Molecules of the solvent water are included 
only implicitly in the calculation. This is in keeping with the reductionist philosophy 
employed here, wherein the complexity of the system is reduced by identifying and 
including only the important degrees of freedom in the system. Rather than trying to 
fold a protein as precisely as possible, the object here is to develop a tractable model, 
based upon sound physical principles, which quickly allows one to determine global 
folding patterns to "sketchbook" accuracy. Uncertainties due to reparameterization 
of the potential function are, accordingly, kept to a minimum. 
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More generic potential functions could, of course, be included in the present cal- 
culation to produce a more refined structure. Such a generic approach, however, runs 
the risk of losing contact with the physical principles which underly protein folding. 
This risk is lessened substantially if, rather than postulating an intricate potential 
function and, subsequently, determining its parameters by fitting to a large data set, 
one instead develops an explicit physical model for the relevant interaction. By using 
reaUstic interaction potentials, the fundamental laws which govern various aspects of 
protein folding can be extracted systematically. Additionally, by refining the model 
through the elimination of nonessential interactions, the computer time required to 
solve it is substantially reduced. An extensive search of configuration space then be- 
comes possible, and the refined model then provides an excellent laboratory for the 
testing and improvement of protein folding algorithms. 

Following the specification of the model, it will be solved using an approach based 
upon the Langevin equation. This formalism is particularly well-suited for approx- 
imating solvent effects, for it describes a system interacting with a medium which 
provides random thermal fiuctuations. When the Langevin equations are integrated, 
the result is a representative trajectory of the protein as it folds toward its most 
probable configuration, the state of minimal free energy. A typical trajectory will 
be shown as a time series of states leading to this minimum. The resultant state 
of minimum free energy forms a reasonable approximation to the native conforma- 
tion. The model is applied to two very distinct proteins, sperm-whale myoglobin and 
leghemoglobin. Despite the fact that these proteins differ greatly in sequence and 
biological origin, they both possess a similar topology (often called a "globin fold"). 
The existence of the globin fold and the topology of the two proteins is accurately 
predicted by the model. 
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b. Myoglobin - an overview. 

It has often been asserted that the tertiary structure of myoglobin is a direct 
consequence of attractive interactions between groups of hydrophobic residues on its 
constituent alpha helices. This assumption led to an early derivation of the structure 
of myoglobin by counting the number of dehydrated regions for various config- 
urations of a simple geometrical model. Subsequent work W -was based upon the 
assumption that helices pair at specific hydrophobic interaction sites with certain he- 
lix axis angles. This assumption reduces the allowed configuration space considerably. 
An initial set of 3 x 10^ structures is ultimately reduced to 20, one of which clearly 
resembles myoglobin. Important later work '^1 posits a specific form for the hydropho- 
bic interaction between amino acid residues, which are approximated as spheres of 
van der Waals radius. The force used is unphysically strong and long-ranged and is 
the same between all hydrophobic residues. A minimization scheme is used to find 
the configuration of minimum energy, which is similar to native myoglobin. 

Inspired by these successes, a physical model of myoglobin is developed below and 
then solved by a Langevin scheme designed to mimic solvent effects. The model is 
intended to be part of a larger philosophy which involves three stages of analysis. In 
the first stage, a tentative identification of regions of secondary structure is made. 
The second stage involves large-scale movement of the tentatively assigned secondary 
structures, which are taken as the fundamental degrees of freedom. The third stage 
is a final "polishing" of the structure, using a detailed model of protein dynamics 
(e.g., [^1). In principle, this final stage employs a complete set of realistic interactions 
and involves all degrees of freedom in the model. Due to the limitations of present 
computational architecture, only small perturbations on the structures generated by 
the large-scale global optimization of the second stage can, therefore, be completed. 
The first and third stages are hkely to be generally possible in the near future; the 
second stage is contemplated here. 

Intrinsic to this approach is the idea that the folding process can be described 
in terms of "effective" interactions between fundamental units which are larger than 
atoms but smaller than the entire protein [6^,66] ^^y. ^j^jg ^.Q^y occur is via the 

collision and coalescence of unstable microdomains, each of which folds and unfolds 
rapidly (the "diffusion-collision" model '''1). The existence of a stable folding inter- 
mediate is, therefore, not necessary for the present approach to be valid. Indeed, 
extensive experimental work on refolding myoglobin generally finds no evidence 
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of a such a folding intermediate at normal (25-27°C) temperatures. (At -6°C and 
50°C, however, folding intermediates in apomyoglobin have been reported I^^'^^l.) 

2. Methods. 

a. Quantitative model of "effective" hydrophobic forces. 

The hydrophobic interaction is certainly a major component of protein folding t^^l . 
Moreover, the above-described successes of models t^"^' for folding myoglobin suggest 
that it is the dominant force between alpha helices in this protein. A quantitative 
physical model for this force is now developed. 

At normal temperatures, the hydrophobic interaction is generally believed to be 
a solvent-induced entropic effect '^^1 . Hydrophobic forces are generated by the spatial 
variation of the solvent free energy and not the enthalpy alone (as with, say, van der 
Waals forces). These hydrophobic forces are here considered "effective" in the sense 
that explicit solvent molecules are not included in their calculation. 

It is widely assumed that the "hydrophobic effect" occurs because nonpolar solutes 
do not participate significantly in hydrogen bonding. The presence of these nonpolar 
solutes in a polar liquid such as water thus disturbs the local hydrogen bond structure 
and leads to an attractive effective interaction between the nonpolar portions of the 
solute molecules. The hydrophobic effect is usually taken to be proportional to the 
solvent-accessible surface area of the nonpolar molecules. 

Such a concept is difficult to employ directly within a numerical simulation, for 
these calculations rely upon knowing the interactions between point particles, rather 
than between surface areas. Since the presence of carbon atoms is generally the defin- 
ing feature of nonpolar solutes, it is therefore sensible to define the carbon atoms in 
hydrophobic residues as the centers of the hydrophobic interaction. The consistency 
of this approach can be tested simply by plotting the transfer free energy of hydropho- 
bic amino acids between polar and nonpolar solvents as a function of the number of 
side-chain carbon atoms. If the hydrophobic effect is essentially produced by car- 
bon atoms, and if, in addition, all carbon atoms in hydrophobic residues contribute 
roughly equally, a linear plot will result. This result is also implied if the carbon atoms 
are taken as hard spheres of excluded volume which interact with water independently 
of one another. 
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A canonical example is shown in Fig. 1. In his classic analysis of the hydrophobic 
effect, Chothia '^^1 plotted transfer data of Nozaki and Tanford '^^^ versus the accessible 
surface area. The data could be described by two lines, for nonpolar and somewhat 
polar residues, respectively. By contrast, a plot of the same data versus the number 
of side-chain carbon atoms suggests a fit to a single line. 

Other hydrophobicity data I^^^^i] ^qj-q Q]gQ analysed in this fashion. It was found 
that the data were reasonably described by a linear fit to either the accessible area or 
the number of side-chain carbon atoms. For long alkane chains, the two descriptions 
are essentially identical. The slope of the linear fit of free energy versus number 
of side-chain carbons, AV, ranges from 300 cal/mol-carbon to 700 cal/mol-carbon, 
depending upon the specific points used and the estimated errors. In this work, the 
midpoint value 

AV — 500 cal/mol ■ carbon (1) 

is used. 

Experimental work on the hydrophobic force suggests that it decays exponentially 
with distance. For example, in their ground-breaking study Israelachvili and Pashley 
[^^1 reported an exponential force with a decay length of 10 A. (This value for the decay 
length was subsequently employed in the folding study of myoglobin by Saito et.al. 
described above f^^.) It is important to note, however, that this experiment never 
directly observed an attractive force. Instead, a theoretical result was subtracted 
from the observed repulsion to estimate the attractive hydrophobic effect. Later 
experimental work [23-24] reported decay lengths of 20 to 160 A. A critical discussion 
of this work was given by Podgornik and Parsegian l^^l . 

It seems unlikely that an interaction whose decay length is of the order of 100 A 
could be relevant for folding myoglobin, which is essentially an oblate spheroid, 44 x 44 
X 25 A . A sensible question to ask is what the appropriate decay length should be 
for the hydrophobic interaction which folds myoglobin. Presumably, the hydrophobic 
effect derives from the local disruption of a partially-ordered water structure. An 
estimate of the decay length should then be possible from the properties of liquid 
water alone P7-3i]_ fjj^g average distance between molecules in ice is 2.8 A, which 
also approximates the location of the first peak in the radial distribution function 
g(r) for hquid water. Estimates '^^1 of the thickness of the hydration shell around 
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typical proteins also yield 2.8 A. Moreover, solvent-accessible area of nonpolar solutes 
is typically determined '^^^ by using a sphere of diameter 2.8 A. It is, therefore, 
reasonable to take the value A = 2.8 A assumed here for the decay length of the 
hydrophobic force. 

For the reasons discussed above, carbon atoms in hydrophobic residues are taken 
as the centers of the hydrophobic interaction. Two carbon atoms in separate hy- 
drophobic residues will then contribute an amount 

Vnir) = VH{0)exp{-r/X) (2) 

to the free energy. Here, r represents the distance between the carbon atoms. It 
remains to specify the interaction strength V//(0). Recall from Eq. (1) that the typical 
free energy of transfer of a hydrophobic carbon from a nonpolar solvent to a polar 
solvent was estimated as AV = 500 cal/mol. This may also be taken as the amount 
by which the free energy decreases as a carbon atom approaches a hydrophobic center. 

The parameter Vh{0) can then be fixed as follows: The Lcnnard- Jones interaction 
between two atoms can be specified in terms of its value at its minimum 

VLj{r)^E^[-C-^Y' + 2C-^f] (3) 

The parameterization of Levitt ^^^1 yields the values = —0.19 kcal/mol at r„j = 
3.53 A between aliphatic carbon atoms. The minimum of the potential VLj(r) + Vi^(r) 
should be less than E^ by an amount AV [see Fig. 2a] . This requirement determines 
Vh{0) — —1.65 kcal/mol, which completes the specification of the hydrophobic force. 

The model proposed in Eq. (2) and the physical arguments leading to the choice 
of parameters used can also be justified independently by making a comparison with 
the microscopic theory proposed by Pratt and Chandler t^^l . The theory of Pratt and 
Chandler involves microscopically well-defined approximations based upon a hard- 
sphere treatment of fluids, and is therefore philosophically consonant with the present 
model (see the discussion of Fig. 1 above). Moreover, it agrees remarkably well 
quantitatively with subsequent computer simulations using realistic models of water 
(see, e.g., P5-37] g^j^^j references cited therein). 

The model of Pratt and Chandler for a hard-sphere radius of 3.5 A (which is 
the minimum of the Lennard- Jones potential given above) at 25°C is compared with 
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the present model in Fig. 2b. The two potentials clearly agree well in terms of 
magnitude and range. One significant difference does, however, exist. Superimposed 
upon the anticipated exponential decay are a series of potential energy barriers which 
oscillate with a period of roughly 3 A, essentially equal to the hard-sphere diameter 
of water. These barriers arise from the existence of water molecules between the 
separated hydrophobic pair. As the hydrophobic molecules approach one another, the 
intervening hydration structures will be altered, and the water molecules between the 
hydrophobic molecules will be forced away. The effective potential which determines 
the interaction must, therefore, change as the hydrophobic molecules approach one 
another, and the relevant energy barrier will disappear. Dynamical effects of this 
nature presumably occur on a picosecond time scale, while for the present study 
(which uses a step size larger than ten picoseconds) only processes on the level of 
nanoseconds to microseconds are relevant. The effective hydrophobic potential model 
Eq. (2) is intended to represent a long-time average of the entropic solvent forces 
active during folding, and so the short-time energy barriers are omitted. This is 
probably about as good as one can do at present without further assumptions or 
some remnant of dynamic water molecules. (This matter is discussed further in the 
concluding section). 

It is well to note one final point. The model of Pratt and Chandler is valid in the 
limit of a dilute solute. Moreover, the present approach treats each hydrophobic atom 
as acting independently of other hydrophobic atoms. This would not be the case if 
significant collective hydrophobic effects were involved. However, the transfer free 
energy of pure hydrocarbon chains is linearly proportional to the number of carbon 
atoms in the chain '^^1, and the free energy of transfer is generally considered to be 
linearly proportional to solvent-accesible surface area, implying that collective effects 
(which would produce nonlinear behavior) are insignificant. 

b. Definition of the potential model. 

The protein studied here is sperm-whale myoglobin (IMBN), with coordinates 
taken from the Brookhaven Data Bank '^^l The protein is separated into the tra- 
ditional 8 hehces, with residue assignments as follows: ^4(3-18), (20-35), C(36-42), 
^(51-57), ^(58- 77), F(86-94), ^(100-118), and i/(124-149). Each helix is viewed as 
a fundamental unit. All atoms in each helix arc moved as a unit, and only forces 
between atoms in different helices are considered. A "spring" connects the final al- 
pha carbon of each helix with the initial alpha carbon of the next helix. The spring 
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potential is taken to be 



Vspring = f{r-ro) kcttl / mol 
f{x) = 5x'^ x>0 

= Ix^ x<0 (4) 

where Tq is set to 3.80+1.35Nres angstroms, with N^es the number of intervening amino 
acid residues. As can be seen in Fig. 3, this potential allows about an angstrom of 
free motion about its minimum, consistent with the notion that the spring represents 
a segment of random coil configuration. 

The computation time required to determine forces between N particles is roughly 
proportional to N'^. It is therefore important to eliminate interactions whose contri- 
bution to the free energy is small. In the parameterization of Levitt t^^l used above, 
the interactions between carbon atoms provide the largest contributions by far. Ac- 
cordingly, only interactions between helical carbon atoms are included, with the pa- 
rameters above used throughout. Lennard- Jones forces between all helical carbon 
atoms are included. Hydrophobic forces [from Eq. (2)] between helical carbon atoms 
in the residues, TRP, PHE, TYR, ILE, LEU, VAL, and MET are also included. The 
heme group is considered as part of the F helix, and all of its carbon atoms are taken 
as hydrophobic. 

Some electrostatic effects are included implicitly by considering the hydrophobic 
effect to be operative only between the residues in the above set. (This is equivalent 
to assuming that, for large-scale folding, the dominant contribution of electrostatic 
effects is to reduce or eliminate hydrophobicity) . The contribution of main-chain 
carbon atoms to the folding pattern was also found to be neghgible. 

c. Solving the model: Brownian dynamics and the Langevin equation. 

Numerical simulations based upon the Brownian dynamics of a Langevin equa- 
tion are indicated when motions on widely separated timescales are involved P^'^^l. 
A protein in a water solvent is a prime example of this phenomenon. Large num- 
bers of rapidly moving solvent molecules are present and slowly cause the protein to 
fold. Present computational means are insufficient for the inclusion of explicit solvent 
molecules, which are inherently of no direct interest. It is, therefore, appropriate to 
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approximate water solvent effects by the Langevin formalism in conjunction with the 
hydrophobic forces defined above. 

The Langevin formalism is also appropriate for another reason. A system which 
evolves by Brownian dynamics will, in general, achieve thermal equilibrium I^^'^^l. In 
thermal equilibrium, the most probable state corresponds to the global minimum of 
the free energy, which here is the native state of the protein. The model protein will, 
therefore, always reach its native state. Langevin dynamics thus possesses a distinct 
advantage over methods like gradient minimization or energy-conserving molecular 
dynamics, neither of which is guaranteed to find the native state. Moreover, the 
physical derivation of the formalism suggests that it will reach the native state in a 
fashion similar to the way a real protein folds in vitro. 

In the Langevin approach, solvent effects are mimicked by the inclusion of random 
thermal forces and viscous friction l^^l . For a single particle, the Langevin equation is 

= F - av + F'(t) , (5) 

at 

where m and v arc the particle mass and velocity, F is the external force as derived 
from a potential, and F' is the fluctuating thermal force. For a spherical particle of 
radius a in a liquid of viscosity rj, Stokes' law yields a = GTirja. The random force F' 
is specified by its ensemble averages: 



< F'{t) > = 
< Fi{ti)Fl{t2) > = 2dab kT a d{t, - ^2) (6) 



where T is the temperature. 



Here the Langevin approach is employed to generate representative trajectories 
for the eight alpha helices of myoglobin. Each helix is treated as a rigid body, with 
three translational and three rotational degrees of freedom. Great simplification of the 
Langevin equations results if atoms (except hydrogen, which is ignored) are considered 
to be identical spheres, with equal average mass and viscosity. (This approximation 
should only affect the dynamics slightly and has no effect on the location of the 
potential minimum). Then the Langevin equation for the center-of-mass coordinate 
of each hehx is formally identical to Eq. (5), with a suitable redefinition of variables. 
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The rotational kinematics of the hehces is also then straightforward. Each helix 
possesses an angular momentum L which evolves in time according to 



in the space-fixed system. Here, r is the torque on the helix arising from the potential- 
derived force, and r' is the torque from random thermal (solvent) motion. The 
ensemble averages corresponding to Eq. (6) are easily seen to be 



< r' > = 

<ra{tiK{h)> = 2kT - 4, S{h-h) (8) 

where /„& is the inertia tensor for particles of equal mass m. 

The translational degrees of freedom are easily integrated using a velocity Verlet 
algorithm l^^l , but the rotational degrees of freedom require special techniques based 
upon quaternionic methods '^^'^^1 . The solvent-induced fluctuating thermal forces are 
calculated at each time step by a Gaussian random number generator with appropriate 
variance. 

The usual checks of energy conservation (in the absence of thermal forces and 
friction) and equipartition were made. The time step size was continuously varied 
according to the convergence of the algorithm, with a minimum step size of about 
ten picoseconds. The longest time scale which could be considered is of the order of 
hundreds of microseconds. Although this is much larger than the 130 picosecond time 
scales considered in classic molecular dynamics simulations '^1, it is still insufficient 
to probe true protein folding scales, which are in the domain of seconds to minutes. 
An additional artifice was therefore employed. The temperature of the system was 
initially set to several thousand degrees, and it was gradually cooled to 25°C. Energy 
barrier penetration thus occurred at a much faster rate than at normal tempera- 
tures, and the folded state was reached fairly quickly. This heating/cooling scheme 
resembles Monte Carlo "simulated annealing" methods in common use, but differs in 
that a realistic continuous folding trajectory is produced here. This trajectory might 
therefore be similar to the true refolding pathway of the protein. 

3. Results: A time series of folding myoglobin. 
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Figure 4a shows a plot of the kinetic and potential energy of the model as it 
evolves through a typical folding sequence. Two lines are plotted, one depicting the 
sum of the hydrophobic and Lennard- Jones potential terms, and one which includes 
in addition the helix kinetic energy. Five points are singled out, beginning with the 
initial configuration A and continuing to the final point E. The protein configurations 
corresponding to these points are shown in Figs. 5a through 5e, respectively. Figure 
4b shows the corresponding rms deviation (defined below), and the radius of gyration 
(using equal masses for all helical atoms). Both are plotted as a function of time. 

In this sequence, the initial unfolded state undergoes a rapid collapse to a folded 
state near the native conformation. Initially, the temperature was set to 2000"C; 
the system was cooled to 25°C over the course of 300 nanoseconds. The viscosity 
parameter was adjusted to ensure optimum convergence, resulting in a thermal relax- 
ation time of about 100 picoseconds. This run required about 10 hours on a Silicon 
Graphics R4000 Iris Indigo. At the end of the run, the system was reheated and 
recooled six times (after the fashion of simulated annealing). The protocol was as 
before, with the state of lowest energy saved at the end of each run. The protein 
settled to what appears to be the unique ground state, with the combined potential 
from Lennard- Jones and hydrophobic effects equal to -1000 kcal/mol. This state is 
drawn in Fig. 6a. For comparison, the native protein is shown in Fig. 6b. The 
rms difference between the two (taken over all helical nonhydrogen atoms) is 5.95 A. 
The specific definition I^^'^^l of rms difference is that used by Cohen, Richmond, and 
Richards W in their classic work discussed above: 

I o 9 ^ 

'^^-A' (,.-l)(n-2) Sg<'g'-''g')^'^ 

wherein consideration of helix pairing sites leads to a set of 20 possible structures for 
myoglobin. Of this set, the structure which most resembles native myoglobin has an 
rms deviation of 3.62 A from the native structure. It must be pointed out, however, 
that this structure was selected from the final set of 20 by visual inspection, rather 
than by their helix pairing criteria. The worst rms deviation of their 20 proposed 
structures was 7.6 A. 

An explicit comparison of the model and native structures is shown in Fig. 7. 
Distances between alpha carbons were determined for all helical residues in the model 
and native conformations. Contoured areas indicate distances of less than 13 A. The 
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hydrophobic hehx-hehx intersection points (BE, BG, AH, FH, and GH) used in [4] to 
derive the structure of myoglobin are exphcitly indicated. Except for the relatively 
unimportant FH contact, agreement is fairly good. (It should be pointed out that 
the CD region is fairly nonhelical, and, as expected, produces the main difference 
between the model and native structures.) 

In order to search for other states of lower energy, five hundred other runs with 
very different starting configurations were performed. No states of lower energy than 
that in Fig. 6a were found, and the minimum energy configuration appears to be 
unique. Typically, there was an initial rapid inertial collapse, followed by a subse- 
quent (slow) stochastic refinement of structure. This behavior has also been observed 
in lattice models of protein folding t'^^' . Often, during the initial collapse phase, a helix 
becomes trapped between two other helices, resulting in a collapsed state of higher 
potential energy than the minimum. This defect generally disappears during subse- 
quent annealing runs, and the system then relaxes to the true minimum. Interestingly 
enough, the inertial collapse often leads to a structure which has almost the correct 
topology of the native state. During the collapse, the protein dynamically tunnels 
through energy barriers, effectively making "choices" about which folding pattern to 
adopt. Some of these choices are quite poor, and lead to helices being trapped by 
other helices. One example of this behavior is shown in Fig. 5e, where the F helix 
is caught behind the H helix, and so encounters an energy barrier on the way to the 
folded state. As can be seen from Fig. 4a, this state is higher in energy than the 
minimum state by roughly 200 kcal/mol. 

The in vitro folding of myoglobin is not yet completely understood, so a detailed 
comparison of the folding pathway shown here with experiment is impossible. How- 
ever, some results have been reported. When the heme group is removed from native 
myoglobin to produce apomyoglobin, the native state is destabilized, and a stable 
intermediate appears in low pH- induced folding t^^l. The detailed three-dimensional 
structure of apoMb is not known, but it has been proposed [i3,49-5i] -j-j^g^^ ^^le A, G, 
and H helices fold to a structure of moderate stability which constitutes an interme- 
diate state of the pathway of apoMb folding. Further work on apoMb '^^1 finds that 
stabilized secondary structure appears in the A, G, and H helices as a compact folding 
intermediate in less than 5 ms. By contrast, refolding experiments on intact native 
myoglobin t^^^^l find no evidence for a stable folding intermediate. It is suggestive 
to compare the time series displayed in Figs. 5 (note especially Fig. 5d), where the 
AGH triplet reaches the topology of the native state considerably in advance of the 
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B-E subdomain. Nevertheless, the work reported in this paper concerns only intact 
native myoglobin, so such a direct comparison with apoMb, though encouraging, is 
tenuous. 

It has also been found ^^^1 that the H helix of intact myoglobin spontaneously 
populates helical conformations irrespective of the state of folding of the remainder 
of the polypeptide chain. These authors suggest that this helix can be considered an 
autonomous folding unit. Additionally, a G helix peptide segment, though unfolded 
in aqueous solution t^^l, forms an ordered helix in TFE (2,2,2-trifluroethanol) [53,54]_ 
It was proposed that in this case TFE might model the effects of stabilizing tertiary 
contacts. 

Taken as a whole, these data are consonant with the philosophy that myoglobin 
folds by the coalescence of nascent metastable substructures (e.g., "molten glob- 
ules" [^^'^^]). These subunits do not possess a fixed spatial conformation, but do 
on average exhibit a high degree of secondary structure. As folding progresses, the 
substructures are stabilized by tertiary interactions and presumably become the fa- 
mihar alpha hehces. Most Hkely, the effects of this partition of configuration space 
manifest themselves early in the folding process. The experimental results reviewed 
above suggest that this reduction of the relevant degrees of freedom in the system is 
completed in time scales of less than a few milliseconds. It may thus be a reasonable 
approximation to model the nascent subunits by native alpha helices. 

One important test of this philosophy can be performed in the present model. If 
each helix in the model forms a realistic approximation to a nascent subunit, then 
fluctuations expected to occur during the folding process should not cause a signif- 
icant change in the final folded structure. An additional check of the model was 
therefore made. In a real protein, the amino acid side-chains are in constant mo- 
tion. Presumably, they attain their native configurations only after the initial rapid 
collapse. It follows that an initial state with a different configuration of rotamers 
should fold to nearly the same state as obtained with native rotamer helices. This 
expectation was tested by replacing each amino acid side-chain by the most probable 
rotamer state and repeating the calculation. The resulting minimum is displayed in 
Fig. 8. As can be seen, only slight differences result. This state has Lennard- Jones 
plus hydrophobic potential energy equal to -933 kcal/mol, and an rms difference of 
6.28 A from the native conformation, using the same definition I^^'^'''^] as before. The 
"rotamer" configuration has an rms difference of 5.56 A from the model configuration 
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Fig. 6a. 



The model can be tested further by applying it to the protein leghemoglobin. 
Leghemoglobin is a protein of 153 residues, of which only 10 are common to the 153 
residues of sperm-whale myoglobin. The biological origin of leghemoglobin, a single- 
chain plant protein which binds oxygen in legume root nodules, is also very different 
from that of sperm-whale myoglobin. Nevertheless, the native structure of these two 
proteins is noticeably similar. Application of the folding model outlined above to 
leghemoglobin accordingly provides an unusually stringent test of the above model, 
for it is probably the most dissimilar protein one can find which possesses a closely 
related folding pattern. 

The leghemoglobin coordinates '^^1 used for this test (2LH7) were obtained from 
the Brookhaven Data Bank. The protein was separated into the traditional 8 helices, 
with residue assignments as follows: A(4-19), 5(21-36), C(37-43), £'(52-58), E{59- 
78), F(87-98), ^(104-120), and i/(127-153). Other details of the potential model 
were as before. The unique minimum energy configuration is displayed in Fig. 9a, 
and the native state in Fig. 9b. The minimum energy configuration had an energy of 
-982 kcal/mol, and an rms deviation of 5.14 A from the native structure. The EFGH 
region of the minimum energy configuration is quite reasonable, and the topology of 
the globin fold is manifest. In a fashion similar to the case of sperm-whale myoglobin, 
the partly helical CD region is somewhat poorly represented. As discussed further 
in the next section, the model also overestimates the attractive interaction between 
the heme group and the pair of PHE residues 29 and 30 in the center of the B helix. 
This causes the B helix to be pulled in closely to the heme group and, in turn, forces 
an additional misalignment of the A helix. Overall, however, the helices do fold to 
roughly the proper locations, and so the desired "sketchbook" accuracy is obtained. 

4. Conclusions. 

Protein folding presents a challenge of monumental proportions. Despite the large 
amount of attention this problem has received, the physical principles which deter- 
mine global folding patterns are largely unknown. Although sophisticated pattern 
recognition techniques may be useful for an initial survey of this unmapped terri- 
tory, true understanding must be based upon the development and testing of physical 
models of large-scale folding. 
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A small step in this direction was taken here. The fundamental philosophy was 
that globin folding proceeds by the coalescence of nascent substructures, approxi- 
mated here as native alpha helices. It was additionally assumed that solvent effects 
determine the large-scale folding of these proteins. Accordingly, a quantitative model 
of hydrophobic forces was developed from physical considerations of transfer free- 
energy data and shown to agree quantitatively with microscopic results from hard- 
sphere models. The model was employed here in conjunction with explicit (Lennard- 
Jones) van der Waals forces. Some electrostatic effects were included imphcitly as 
well. Simulation of folding was performed by Langevin techniques crafted to repro- 
duce closely the kinetic effects of solvent molecules, while maintaining close contact 
with classical molecular dynamics. 

The model was applied to two proteins of radically different amino-acid sequence 
and biological origin: sperm-whale myoglobin and leghemoglobin. In each case, a 
unique minimum was found and was shown to correspond well with the topology of 
the native structure of the protein. The model is very economical, and thus allows an 
extensive search of configuration space to be performed. In order to increase the speed 
of computation, all carbon atoms were treated equally. This is somewhat unrealistic, 
for the hydrophobic interaction produced by a carbon atom in an aromatic ring must 
be different from that produced by a carbon atom in an alkane chain. Thus, the 
hydrophobic effective potential of PHE is overestimated, while that of LEU and ILE 
is underestimated. Nevertheless, the gross stuctures obtained do, in fact, have the 
proper topology-the helices appear in their proper relative locations. This success 
implies that many of the physical principles relevant for large-scale folding have been 
accurately captured in the model. 

Once a structure with "sketchbook" accuracy is produced, it is of course possible 
to include more precise interactions for a further refinement of the prediction. This 
should be computationally feasible when folding is nearly complete and the gross 
structure of the protein has been obtained. The present model, however, does allow 
the large-scale folding process to be observed directly, which should be useful for 
further design of physical models and algorithmic developement. In particular, the 
model exhibited a number of interesting phenomena supportive of the concept '^^1 
that protein folding is not simply a uniform collapse, simultaneous on all length 
scales. This may be heartening news for the protein folding problem in general, for it 
suggests that not all interactions are relevant at all times during the folding process. 
It may therefore be appropriate to address protein folding by developing a hierarchy of 
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"effective" interactions between subunits of various sizes. It is, in a sense, equivalent 
to determining tlie correct "language" for the description of protein folding (one does 
not generally use subatomic physics to describe chemistry) . Significant simplification 
could then well occur. 

The physical principles which undcrly this point are quite clear, and can be il- 
lustrated by an elementary paradigm. Consider an initial configuration of the model 
where the helices are all ahgned in a straight hne. All eight hehces will experience 
torques which cause them to rotate toward the configuration of least energy. The A 
and H helices are initially the least constrained, for each of the remaining helices is 
connected at both ends to other helices by a "string" of random coil. It is energet- 
ically most favorable either to move only these two helices, or else to move groups 
of hehces which include either the A or H hehx. (This argument also suggests that 
refolding for an alpha-helix protein like myoglobin should begin at the ends of the 
molecule, which might partly explain the early importance of the A, G, and H helices 
in refolding processes I^^^^^l). As a consequence, typically groups of several helices 
must move together as subunits. Collective coordinates even larger than alpha he- 
lices therefore are important role at certain stages of the folding process, and their 
appearance offers additional validation of the idea that the identification of relevant 
large-scale degrees of freedom (such as the alpha helices taken here as a basis set) is 
important in protein folding. If this spontaneous reduction of the degreees of freedom 
in the system is a general physical principle, it could provide one possible resolution 
to the Levinthal paradox 1^1 mentioned in the Introduction. Thus, if the number of 
physically relevant degrees of freedom in the system is far fewer than those present 
in the entire configuration space, a considerably faster search for the folded state is 
implied. This simplification is also useful computationally, for it suggests that the 
speed of folding algorithms may generally be improved by the inclusion of moves of 
collective coordinates, of which nascent alpha helices are only one example. 

Another useful point is the clear existence t^^^ of two stages in folding-an inertial 
collapse phase followed by stochastic refinement of structure. This separation sug- 
gests that substantial inertial effects should be included in a computationally effective 
protein folding algorithm. Indeed, when the present Langevin methodology (which 
includes dynamical inertial effects) was compared with standard Monte Carlo sim- 
ulated annealing, it was found to be quite superior in performance. This difference 
in performance persisted even when the step size for individual Monte Carlo moves 
was allowed to vary dynamically so as to ensure an acceptance rate of about one 
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half, in accordance with standard techniques of simulated annealing. The need for 
alternatives to pure canonical ensemble Monte Carlo methods in molecular simula- 
tions [59'60.35] when short-range forces lead to large-scale collective effects ^^^^ is 
well-known. 

Since the present algorithm is based upon classical molecular dynamics, it uses a 
real physical time coordinate. Thus the time taken for various subprocesses can be 
estimated from observation of the folding process. Observation times are computa- 
tionally limited at present to microsecond time scales, which requires unphysically 
high temperatures to be used in the slow "stochastic" second phase of folding. How- 
ever, the initial rapid inertial collapse was generally visible, even at room temperature. 
This suggests that the initial collapse phase typically occurs in a microsecond time 
scale, which is also the time scale at which alpha helices form and disappear [62-63] 
Therefore the inertial collapse occurs at the same time that the alpha helices them- 
selves are becoming stable. One implication of this result is that alpha helices become 
a significant part of the protein folding pathway before they are actually stable en- 
tities. As has been pointed out previously '^^1, there is no meaningful threshold of 
stabihty for alpha helices to become relevant in folding. 

One of the virtues of the present model is its economy, which permits an extensive 
search of configuration space to be made with present computational architecture. 
Although more detailed models of the effective hydrophobic interaction can be con- 
structed (see, e.g., ^^^l), their complexity is such that the accessible time scales are 
typically of the order of hundreds of picoseconds. Explicit inclusion of water solvent 
molecules reduces this accessible timeframe further, by another factor of twenty '^^1. 
The philosophy presented here allows effects whose natural timescale is of the order of 
tens to hundreds of microseconds to be considered, albeit with generally less precision. 
The accessible timescale is, however, increased by a factor larger than a million. 

A natural question to ask, therefore, is whether the model can be simplified fur- 
ther. One attempt in this direction was tried. The attractive portion of the Lennard- 
Jones potential was removed, and the coefficient of the hard-core repulsion was varied 
to give different values of the minimum hard-sphere distance of the potential (which 
is then the only parameter remaining, aside from the size of water used as the decay 
length). The resulting model is then probably as simple as can be conceived, for a 
repulsive part of the potential is required to prevent the helices from collapsing into 
one another, while an attractive force is needed to drive the folding process. Rea- 
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sonable folded structures could be obtained when this minimum value was taken to 
be 4.0 A, which occurred when the repulsive part of the potential was the same as 
employed in the full model. The rms value so obtained was 6.07 A, to be compared 
with 6.59 A and 6.61 A reached for potential minima of 3.5 A and 4.5 A respectively. 
Values of the minimum significantly less than 4.0 A led to overly compact structures, 
while larger values produced structures which were too extended. 

The fact that somewhat plausible structures can be obtained when the only force 
driving folding is the hydrophobic effect provides quantitative confirmation of the 
idea that it is this effect which determines large-scale globin folding patterns. There 
is a philosophical inconsistency in removing the attractive part of the Lcnnard- Jones 
potential, however, for the hard-sphere radius of a molecule is typically taken as the 
location of the minimum of this potential. When the attractive part of the Lennard- 
Jones force is eliminated, the Lennard- Jones potential then has no minimum, and 
the hard-core radius is undefined. To prevent collapse, the attractive hydrophobic 
force must be balanced by an arbitrarily chosen hard-core repulsion. The hard-core 
repulsion then must be parameterized and determined by observing the folding of 
myoglobin or some similar process. By contrast, the hydrophobic potential model 
applied in this work was derived from physical arguments, independently of knowledge 
of the structure of myoglobin. The recognizable structure so obtained accordingly 
suggests that these arguments have a fair degree of validity. 

In summary, then, the purpose of this work is not to attain the most precise 
depiction of folding possible, but rather to attempt to learn the most use/u/ language in 
which to describe the process. With few exceptions, all useful descriptions of complex 
systems involve judicious approximations. So it is here. A careful attempt was made 
to develop an accurate physical model of the most important effects which determine 
folding patterns. Rather than taking an intricate potential function and determining 
its parameters empirically (ultimately equivalent to a form of pattern recognition), 
the goal of logical simplicity was sought throughout. The result is a tractable model 
of protein folding based upon sound physical principles which, nevertheless, captures 
many of the essential features of more detailed models. Extension of the present 
formalism to other proteins should, therefore, be of some utility in the systematic 
extraction of the physical principles which determine large-scale folding patterns. 
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Figure Captions. 



Fig. la. 

Plot of accessible surface area (A^) versus hydrophobicity (kcal/mol) for various 
amino acid residues. Lines of slope 26 cal/mol • and 22 cal/mol- A^ are also shown. 
Adapted from Chothia [16]. 

Fig. lb. 

The same data [17] as Fig. lb plotted versus the number of carbon atoms in the 
residue side-chain. The hue has a slope of 321 cal/mol • carbon. 

Fig. 2a. 

Solid line displays Lennard- Jones potential VLj{r) from Eq. (3). Dashed line is 
sum of VLj{r) and Vff(r). Potential minima differ by 500 cal/mol (see text). 

Fig. 2b. 

Solid hne shows the potential of mean force derived by Pratt and Chandler t^^l for 
a hard-sphere radius of 3.5 A. Dashed line shows the hydrophobic effective potential 
Vff(r) proposed here. Energy is plotted in units of RT at 25°C, distance in A. 

Fig. 3. 

Potential function f{x) (kcal/mol) versus x (A) from Eq. 4. 
Fig. 4a. 

Lower line is a plot of the sum of the hydrophobic and Lennard- Jones potential 
terms (kcal/mol) given in the text versus time (nanoseconds). Upper line includes 
helix kinetic energy. In addition to the initial state A and the final state E, the points 
B, C, and D are labelled. 

Fig. 4b. 

Plots of rms distance from the native protein (as defined in the text) and radius of 
gyration versus time (nanoseconds), corresponding to Fig. 4a. The plots are labelled 
"rms" and "gyr" respectively. 
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Figs. 5a-e. 

Myoglobin configurations corresponding to points labelled A through E in Fig. 4. 
Helices are drawn as cylinders along major axes of inertia tensors. Heme unit carbon 
atoms drawn as spheres, central (larger) atom is iron. Helix and atomic radii not to 
scale. 

Fig. 6a. 

Unique folded configuration produced by the model, drawn as per Figs. 5. 
Fig. 6b. 

Native myoglobin, drawn as per Figs. 5. 
Fig. 7. 

Plot of distances between alpha carbons of helical residues. Axes indicate residue 
numbers. Contoured areas indicate distances less than 13 A. Lower half is native 
conformation (Fig. 6b), upper half is model (Fig. 6a). Hydrophobic contact zones 
(BG, etc.) of Ref. [4] labelled as shown. 

Fig. 8. 

Folded "rotamer" state, drawn as per Figs. 5. 
Fig. 9a. 

Unique folded configuration of leghemoglobin produced by the model, drawn as 
per Figs. 5. 

Fig. 9b. 

Native leghemoglobin, drawn as per Figs. 5. 
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